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Abstract 

Measures of association play a role in selecting 2x2 tables exhibiting strong dependence in high-dimensional 
binary data. Several measures are in use differing on specific tables and in their dependence on the mar- 
gins. 

We study a 2-dimensional group of margin transformations on the 3-dimcnsional manifold T of all 
2x2 probability tables. The margin transformations allow introducing natural coordinates that identify 
T with the real 3-space such that the x-axis corresponds to log(sqrt(odds-ratio)) and margins vary on 
planes x=const. We use these coordinates to visualise and compare measures of association with respect 
to their dependence on the margins given the odds-ratio, their limit behaviour when cells approach zero 
and their weighting properties. 

We propose a novel measure of association in which tables with single small entries arc up- weighted 
but those with skewed margins are down- weighted according to the relative entropy among the tables of 
the same odds-ratio. 



Keywords: two by two probability tables, measures of association, entropy 
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1 Introduction 

2x2 tables of binary markers with random margins are intriguing in several respects: First, there is 
a confusing plethora of measures of association in 2x2 tables with random margins that are used in 
statistical practice. Their relative merit is unclear. Some of them were developed for 2x2 tables with 
fixed margins and then extended to the case considered here. Measures typically agree in the ordering 
by strength of association on 2x2 tables that have diagonal symmetry and in case of independence. But 
they markedly differ in asymmetric tables or in tables which are "far from independence". We develop 
a unified framework to analyse, visualise and compare measures of association in 2x2 probability tables 
especially with respect to their dependence on the margins. 

Second, 2x2 tables "far from independence" may approximate logical forms like logical equivalence 
(one diagonal is zero) or implication (one entry zero). The task of selecting particularly interesting and 
informative tables among a large number of tables is often encountered in the analysis of data consisting 
of high dimensional binary patterns (e.g. linkage disequilibrium of SNPs, patterns of aberration at various 
DNA loci, patterns of protein expression etc.). We suggest a principled approach for picking tables which 
approximate logical relations. This approach relies on an entropy-based weighting of tables and aims to 
improve existing measures often used in Genetical Statistics. 

Defining and justifying measures and estimating them from empirical data are radically differ- 
ent tasks. We have investigated methods of estimating measures of association in a separate paper 



(jScholz fc Hasencleveil 120101 ). Here we deal exclusively with abstract 2x2 probability models and their 



mathematical structure. 



2 Mathematical structure of 2x2 probability models 

2x2 tables of binary markers with random margins can be considered as tetranomial distributions with a 
symmetry structure. Symmetry of 2x2 tables can be described by the dihedral group D4 generated by the 
transposition of the binary markers (matrix transposition) and transposition of their values (transposition 
of columns or rows). 

Wc consider the manifold T of all non-degenerate tetranominal probability models which we write in 
two by two lay-out: T consists of all two by two matrices t with entries pij S M, G {0,1}) subject to the 
constraints pij > 0, J2i jPij = 1- The pij denote the probabilities of the corresponding combination of the 
states of two binary markers i and j . In the following, we abbreviate X]i=o '^j=o ~ Si j ' P'-- ~ 
and p,j ~ paj + pij . The margins pi, and p,j give the marginal distributions of the marker i and j 
respectively. 

In T we have several relevant submanifolds. There is a marked point mo, namely the midpoint 
(1/4 1/4)' There is the 1-dimensional submanifold DS of all tables with diagonal symmetry of the form 

(^b'a)- ^^"^ there is the 2-dimensional submanifold MD of independent tables with = Pi. ■ p.j 

By T we denote the closure of T. The border dT = T — T consists of tables with at least one zero: 
four two dimensional sides {py = 0} for any six one dimensional edges of vanishing rows {p,j = 0}, 
vanishing columns {pi, = 0} and two vanishing diagonals {poo = Pii = 0}, {poi = Pw = 0} as well as four 
triple zero vertices {pij = 1}. 

Manipulating the margins defines an additional structure on T. Wc can multiply rows or columns 
with positive numbers and renormalise: Formally, consider the group G ~ (M+ x M+, •) with component- 
wise multiplication. 

For every {fi, v) E M"*" x M+ we define a map: g(/i, ly) : T — T 

t^h"^^')^gi,,.m = — — ^- — — 1 (1) 

\PwPllJ fivpoo + fipoi + lypio + Pll \ vpiQ 
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Since g(/i, v) o g{^' , v') = ■ fi' ,v ■ v') and 1) = Wt this defines a G-group action on T. 

Lying in the same group orbit defines an equivalence relation on T: We say two elements <i,t2 G T 
are equivalent ti ^ t2 if and only if there arc (n,!^) G x R+ with g{^,v){ti) = t2- G-Orbits are 
diffeomorph to M+ x M+. 

A real function r/ ; T ^ M is G-invariant if i-j{t) = r]{g{fi, i^){t)) for aU i^) e K+ x K+. 



Proposition 1 (odds-ratio): 

a) The odds-ratio A : T ^ M; t ^ ( P"A ^ \(t) = ££i«£ii is G-invanant. 

h) The odds-ratio classifies the G-orbits. Let T be the quotient space of T by the equivalence relation 
induced by G. A induces a bijective map A : T 

"/ VL 

c) The inverse mapping X^^ : — > T can be described by I 



2-(i+y/) 2-(i+y/) 

1 yi 



d) Every G-invariant Junction rj :T - 



2-(l + Vl) 2-(l + \//) 

can be written as a function of \, namely rj = [f] o X^^) o \. 



Proof: a) is easily verified, b) Every equivalence class [t] in T has a representant with margins ^, 



namely [g(. , which has the form given in c). d) is trivial. 



We next define new coordinates on T to make use of this insight. 



Proposition 2 (Margin transformation coordinates on T highlighting the G-action and its 
invariant): The map 9 : T ^ 



t 



Poo Poi 
Pio Pll 



is a diffeomorphism. 
The inverse ^ = 



PoiPio 



PlOPll 



^■^ — > T is given by 
5(e^e^)) 




PooPio . 

POlPll ' 



(2) 



In these new coordinates, x corresponds to the log odds-ratio, while y and z determine the G-transformation 
that maps the tabic to diagonal symmetry. In addition, the midpoint toq corresponds to the origin (0, 0, 0). 
G-orbits (odds-ratio = constant) correspond to planes {a} x R^. In particular, the submanifold of inde- 
pendent tables IND maps to {0} x M^. The tables with diagonal symmetry D§ form the line M x {0} x {0}. 
Transposing rows and columns of a table is equivalent to transformations y —y and z — )■ — z, while 
matrix transposition is equivalent to the transformation y z. 

Let R := M U {— oo,+oo} be the two point compactification of R. W' is a compactification of R'^ 
as a cube. We use a short hand notation to describe the boundaries abbreviating -|-oo as "+", — oo as 
"-" and any finite real number as The eight vertices V = {(±±±)} split into two sets of four: 
^9 = {(+ + +), (+ - (- + - +)} and n = {( ), (- + +), (+ - +), (+ + -)}. 



Proposition 3 (Extension to the borders): and 8 considered as set valued functions can be 
extended to M? respectively T. They remain inverse to each other. The mappings of the borders can be 
characterized as follows: 

• The vertices Vg together with their respective adjacent edges map to the vertices in T. 
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• The faces of T correspond to the vertices Vb ■ 

• The faces (± * *) of the cube map to the diagonal edges Pqq = pn = and poi = Pia = m T. 

• The faces (* ± (* * ±) correspond to tables with vanishing rows {p,j = 0} or vanishing columns 
{Pi. =0} in T respectively. 

This behaviour is illustrated in figure 1. These different compactifications will later be used to 
characterise the limit behaviour of association measures. It will turn out that the limit behaviour can be 
easier described using the margin transformation coordinates. 



3 Measures of association 

We will now investigate various measures of associations between two binary markers. First we define 
the objects of interest. 



Definition (Measures of association): A measure of association between binary markers is a contin- 
uous function : T — > R with the following properties: 

a) T] is zero on independent tables. 

b) r] is a strictly increasing function of the odds-ratio when restricted to tables with fixed margins. 

c) 7] respects the symmetry group D^, namely: 

cl) rj is symmetric in the markers, i.e. invariant to matrix transposition. 

c2) T] changes sign when states of a marker are transposed (row or column transposition) . 

A measure of association is standardised if its range is restricted to (—1,1). 



3.1 Measures based on the odds-ratio 



1963 ): 



The odds-ratio Odds-ratio A (jEdwards 

^ _ PooPu 
PoiPio 

can be used to define measures of association. As A is G-invariant, monotone transformations automati- 
cally fulfill condition b) of the definition. 

Standardised measures of association derived from the odds-ratio include Yule's Q (jYuld . 1 19001 ): 
A- 1 



Q 



A+1 



and Yule's Y (Yule 



1900): 



Y = 



^/A- 1 
VA + 1 



Obviously, both Q and Y are measures of association. Similar to the odds-ratio, both are extremal if one 
of the pij tends to zero. 



3.2 Measures based on additive deviations from independence given the mar- 
gins 

Fixing margins results is a one dimensional submanifold of tables that can be additively parametrised by 
a parameter D. 
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All such tables have the form: 

I Po. ■P.0 + D 
\ Pi. -p.o- D 

D — PooPii ~PoiPio ^ Poo ~Po.P.o describes the additive deviation from the independent table with the 
given margins. This measure is zero in case of independence of the markers but extremal values depend 
on the margins. 




Lewontin's D' ijLewontinl 119631) : The measure D' is a standardisation of the original measure D: 



D' ^ ^ where = | ^^^^^{vo.P.uP.oP.} if D > 

Dmax [ mm{po,p,o,pi,p,i} if D<0 

Lewontin's D' ranges from —1 to 1 and tends to these values if at least one of the pij tends to zero. 
D' is widely used in genetics to measure linkage disequilibrium. When a new SNP emerges in a population 
by a single mutation event, the new allele is exclusively found in conjunction with only one of the two 
alleles of already existing SNPs. As long as no recombination events occurs, the new SNP remains in 
complete linkage disequilibrium with the other SNPs. The corresponding 2x2 tables feature a single zero 
cell. Thus in this context a measure is needed that is extremal whenever a single entry tends to zero. 

Since Dmax is constant for tables with fixed margins and D increases with increasing odds-ratio, 
D' is a monotone function of the odds-ratio for constant margins. Symmetry is obvious. Hence, D' is a 
standardised measure of association. 



Correlation coefficient r (jHill fc Robertsonl . 119681) : The correlation coefficient applied to binary data 
has similar popularity in genetics as D' . It ranges also from —1 to 1, but, in contrast to D', the absolute 
value 1 is obtained when a diagonal of t tends to zero: 

D PooPii - PoiPio 



^/Po.P.aPi.P.i ^/Pa.P.oPi.P.i 
With reasoning similar as for Z3', r is a standardised measure of association. 

Proposition 4 (Equality of r, D' and Y on diagonal tables): The measures r, D' and Y 
coincide on the set of diagonal tables, i.e. tables with pair-wise equal diagonal elements. 



Proof: This follows directly after calculating these measures for the tables t — 2a+2b (ba)' ^' 
□ 

3.3 Measures based on information theory 



The mutual information (jWeaver fc Shannonl . 119631 ) is defined as the difference between the information 



of the given table and the independent table with the same margins. 

Mutinf = ^ p„ • log2 {ptj )-^Pt.- log2 (Pi. ) - X! Pi ■ ^°S2 {P.j ) 



Mutinf takes values only in [0, 1]. In order to make it a measure of association according to our definition, 
we define a signed version: 

sMutInf = sign(D) • Mutinf 



Proposition 5: sMutInf is a standardised measure of association. 
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Proof: The symmetry of this measure is clear. To show that sMutInf is a monotone function of the 
odds-ratio, we consider the tables = ( Poo+epoi-£ \ ^qj. sufhciently small e > 0. These tables have 

the same margins as the table t ~ (piopij) ^lut higher odds-ratios. Assume that A > 1, we see that 
^l^ p sMutInf (ie) = log2 A > 0. Hence sMutInf is monotone, and thus, a measure of association. □ 

Mutinf approaches 1 only if approaching ^ ^^^^ ) while r approaches 1 by approaching tables of the 
form(^°),a,6>0. 

3.4 Counter example 



Kappa coefficient ([Cohenl . 119601) : The Kappa coefficient which is useful in quantifying the agreement 



between two raters is defined as: 

_ Poo + Pii - Po.P.a - Pi.P.i 
1 -Po.P.o -Pi.P.i 

Kappa is not a measure of association. Although it fulfils the condition of monotonicity, it is not 
symmetric. 

4 Comparing measures of association 

We use the coordinates introduced in Proposition 2 in order to describe and visualise how measures of 
association depend on the margins. In particular we study measures of association 77 restricted to x=const 
i.e. for fixed odds-ratios. The restricted functions will be denoted rj^ and called margin weighting func- 
tions. We characterise the shape of the margin weighting functions and study their limiting behaviours 
and extensibility to the compactification R"^ in comparison to T. 
The association measure r expressed in margin transformation coordinates reads: 

r(x,y,z) = ^ ' (3) 

_^ gy) (e^+y+z + e^) (e^ + ev) (e^ + e^) 

The margin weighting function of r for odds-ratio A = 40 is shown in figure 2. 
Proposition 6 (Margin weighting function for r): For all x £ R \ {0}; 

a) r^ has exactly one extremum at the origin {y,z) = (0,0), corresponding to the diagonal symmetric 
table with the fixed odds-ratio. 

b) lim||(,y^.)||^oo rx = 0. 

c) lim^^ioo r^^±l 

d) r can be extended to W" except for the lines (±,±,>i<) and (±,*,±) and the vertices V. 

e) r can be extended to T except for the vertices. 

Proof: see appendix. 



The measure r down- weights tables with skewed margins. 
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The association measure D' expressed in margin transformation coordinates reads; 

(e^^ - 1) ey+'' 



D' {x,y,z) 



where 

^rnax 

(e-^+y+~' + ey)(e^ + ey) 
(e^ + ey)(e^ + e^) 



a; > 0, 
X > 0, 
X < 0, 
X < 0, 



(4) 



y < z 

y>z 

y <~z 



The margin weighting function of D' for odds-ratio A = 40 is shown in figure 3. 



Proposition 7 (Margin weighting function for D'): For all x E M. \ {0}; 

a) D' X has a non-dijferentiable edge along the diagonal y = z for D' > and along the diagonal y 

for D' < Q. There is a non-smooth saddle point in the origin. 

b) 



1/— J-±oo 



lim D'x 

z— >-±oo 



(e^^ + e^±^) ^ : X > 



X < 

X > 
X < 



Thus, limit functions have a range of (0, 1 — e for x > and (e^^ — 1,0) for x < 0, where 
obtained for y — > ±oo, z — > ±oo, x > and y — > ^oo, z — > ±oo, x < 0. 

c) hm2;^±oo = ±1 

d) D' can be extended to M?' except for the vertices Vg. 

e) D' can be extended to T except for the edges and vertices. 



Proof: see appendix. 



D' gives higher weights to certain tables without diagonal symmetry. The measure up-weights or down- 
weights tables with skewed margins depending on the position of zeros which occur in the limiting tables 
(see figure 3). Comparing d) and c) one recognizes that the introduction of the odds-ratio as coordinate 
allows extending D' to limit tables with vanishing colums or rows. 

The association measure sMutInf can also be written in margin transformation coordinates but this 
is skipped due to the lengthy formula. The margin weighting function of sMutInf for odds-ratio A = 40 
is shown in figure 4. 

Proposition 8 (Margin weighting function for sMutInf): For all x G R \ {0}: 

a) sMutlnfj; has exactly one maximum at the origin {y,z) ~ (0,0). 

b) limj|(j^^^)j|^oo sMutlnfj; = 0. 

c) lim^^ioo sMutInf, = ± (logs (e^^" + 1) - logs e^*^) 
Thus sMutlnfa; —> ±1 for y = ^z and x — > ±oo respectively. 

d) sMutInf can be extended to W' except for the vertices VJ,. 

e) sMutInf can be extended completely to T. 



Proof: see appendix. 



Thus, similarly to r, sMutInf down- weights tables with skewed margins (see figure 4). 
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The association measure Y in margin transformation coordinates can be simply written as: 

X 

Y{x,y,z) = tanh- (5) 

Proposition 9 (Margin weighting function for Y): For all x G M: 

a) Yx is constant. 

b) lim||(j^_^)||^oo = tanhf . 

c) lim2;^±oo = ±1 

d) Y can be extended completely to M.^ . 

e) Y can be extended to T except for edges and vertices corresponding to vanishing rows or columns. 
Proof: is trivial. □ 



5 Entropy 

Among tables of a fixed odds-ratio, we look for a principled approach to prefer interesting tables and 
down-weight obscure "junk" tables. As a candidate we study the table entropy on T. The entropy 
function iJ : T ^ M is defined as the negative expectation of the loglikelihood of the tables: 

^ / / Poo Poi \ \ _ _ ^^^^ ^ log2(poo) +P01 • log2(poi) +P10 ' logaCpio) +P11 ■ logabii)) 
V VPio PiiJ J 

Why is entropy a candidate to select among tables? It can be charcterised in multiple ways: For general 



finite discrete distributions the entropy was introduced by Shannon (1948) (IShannonl . Il948f ). Shannon 



characterised if by a set of postulates to measure the uncertainty in a discrete distribution: 

Shannon's characterisation of Entropy: If functions i/„(pi, ...,p„) with pi > 0,J2Pi ^ > 2 
satisfy the conditions 

a) i?2(p, 1 — p) 'is a continuous positive function of p. 

b) Hn{pi, ...,Pn) is symmetric, i.e. invariant under permutations of the pi, ...,p„ for all n. 

C) Hn{pi,...,Pn) = Hn-l{pi + P2 , P3 , • ■ • , Pn ) + (Pl +P2) ' ^2 ( ^^^^^ , (p^+p,) ) 

then Hn{pi, ...,p„) = ■ J^Pi log2(Pi) for some K > 0. 

Tables with high entropy arc interesting as they have high uncertainty and " surprise value" . 



Jaynes (jJavnesl 120031) gives an independent combinatorial characterisation: When we sample sequentially 
from a table t € T we obtain a vector of observations of length N, which we summarise as a frequency table 
tM ~ 1/N ■ ( ""^ I . Each frequency table is characterised by the number WitAi) = — ; — — i r 

J' / \^nio "11/ ^ J 11 J \ II J noo!noi!nio!nii! 

of sequences which realise t^ . Intuitively, tables that can be realised in multiple ways are more plausible 
than those that can be realised only by few sequences. We can use Stirlings formula for n\ to approximate 
VF(£jv). In the limit N ^ oo, tN ^ t in probability and • log{W{t^)) H{t). Thus the entropy 
describes the combinatorial plausibility of a table. 



Given a set of distributions fulfilling certain constraints, Jaynes (jJavnesl . 120031 ) proposes to pick the 
corresponding maximum entropy distribution as the most uncommitted and prototypical distribution. 
Looking at the margin weighting function of the entropy leads to a surprise: 

Recall that Lambert's W-function is defined as the inverse function to x exp x. is a multi-branch 
function since y = a;exp(a;) has two solutions for y e (— l/e,0). We can prove the following: 

Theorem 1 (magic odds-ratio): Define the "magic odds-ratio" by Lmagic = W{l/e)~'^ sa 12.89. 
Let L > 1. The entropy H restricted to the submanifold of constant odds-ratio L inT 
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• has a single maximum at the diagonal table of odds-ratio L if 1 < L < Lmagic- 

• has a saddle point at the diagonal table of odds-ratio L and two "L-shaped" tables as maxima which 
transpose with matrix transposition if L magic < L. 

"L-shaped" means that for L — > cx) one of the maxima approaches the table ^ Y/s^ ■ ^'^^ '^'^^^ L < I 
a similar result can be derived by transposing principal and secondary diagonals. 

Proof: There are two constraints to be considered, one of them not hnear in pij: 

In(poo) - In(poi) - In(pio) + hi(pii) = ln(X) (6) 
Poo +Poi +Pio = 1 (7) 

Using Langrange multiphers, critical tables of H restricted to odds-ratio equals L can be expressed in 
terms of Lambert's W function. The bifurcation occurs for Lmagic < L because Lambert's W is multi- 
branched. See appendix for details. 

This theorem suggests that the "magic odds-ratio" is a natural cutpoint between weak and strong asso- 
ciation. For weak association L < Lmagic, interesting tables are those near DS. For strong association 
Lmagic < L, particularly interesting tables are those that approach "L-shape", i.e. those in which one 
cell differs in magnitude from the three others. 



6 An entropy-based measure of association 

Using these insights on the entropy of a table, in this section we aim to define a measure of association 
with similar properties to D' , Y but better limit behaviour, i.e. the measure should down- weight tables 
with almost vanishing rows or columns or single entries. These tables are denoted as junk tables in the 
following. We have seen in the last sections that D' and Y could be large for these tables. 

We also like to recall that both, D' and Y become extremal if the table features a single entry equals 
zero while r, sMutInf require a vanishing diagonal. We like to retain this property for a new measure to 
be defined. Another feature to be retained is the agreement of measures for diagonal tables which holds 
for Y, D' and r. 

According to our definition, an important property of a measure of association is that it is a mono- 
tone function of the odds-ratio when the margins are kept fixed. For the entropy, one can prove the 
following lemma: 

Lemma 1 (Monotony of the entropy difference): Let H be the entropy of t and Hdiag be the 
entropy of the corresponding diagonal table of the same odds-ratio X. Then, Hdiag — H is monotonically 
decreasing for increasing A > 1 and constant margins. 

Proof: see appendix. 

As a direct consequence of this lemma, it is easy to see that: 
Corollary: 

HS„ := signF irl'^'P"'^'^'"''-^) (8) 
is a measure of association for arbitrary n > 0. 
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This newly defined measure fulfils all above mentioned properties: It coincides with Y, D' , r at di- 
agonal tables, is extremal for tables with a single zero, up- weights L-shaped tables for large odds-ratios in 
the sense that HS„ > Y and down-weights junk-tables in the sense that HS,i < y at the margins (proof 
see below). However, the down- weighting is imperfect as HS„ > for junk-tables. 

The parameter n can be chosen in order to define the degree of up- and down- weighting. According 
to our observations, n = 4 is a reasonable choice resulting in a satisfactory down-weighting of junk tables 
(see later). 

The measure HS„ can be written in margin transformation coordinates using 

X 



Hd.,ag{x,y,z) = 1 + log2 (1 e'') 



H (x, y, z) = log2 (e^+!'+^ + + + e") 



ln2 (e-^ + 1) 

{x + y + z) e^+2'+2 + xe^ + ye" + zd" 



In 2 (e^+y+^ + + + e"") 

At figure 5 wc present the margin weighting functions of HS„ for A 5 and A — 40. These functions can 
be easily characterised using the results of the previous section: 



Proposition 10 (Margin weighting function for HS„): For all x G M. \ {0}: 

a) For X G (—1 — W (l/e) ,1 + W (1/e)), HS„^ has exactly one maximum at the origin (y, z) = (0, 0). // 
X < —1 — (1/e) or X > 1 + W (1/e) , HSn^, has a saddle-point at the origin and two extrema elsewhere. 
At these extrema, the elements of one diagonal are equal while at the other diagonal there is one (small) 
element. 

b) HS„^ has the following limit functions 



lim HS, 



sign I tanh ■ 



tanh ■ 



expn-^ l-|-log2(l-|-e°')- 



l„2(e-- + l) 



+plog2 p+(l-p) log2(l-p) 



where p = {1 + e^*^)) ^ for y — > ±cxi or p ^ (1 + e^^^)) ^ for z — > ±c» respectively. Thus, the limit 
functions have an extremum at p ~ 0.5 that is z ~ ^x for y — > ±oo and y — ^x for z — > ±oo respectively. 

c) lim2;^±oo HS„^ = ±1 

d) aSnx < 0-t the margins, i.e. HS„ down-weights junk-tables. 

e) HS„ can be extended completely to R^. 

f) HS„ can be extended to T except for edges and vertices corresponding to vanishing rows or columns. 

g) For all x e M, HS„ coincides with Y , D' , r at diagonal tables. 



Proof: a) follows from the Theorem 1. b) is easy to see taking the limit of the tables first, c) is 
clear since \Ya\x^±oo tanh ^ = ±1 and the exponent is finite, d) holds since Hdiag > 1 and i7 < 1 at the 
margins of finite x. e) and f) are consequences of b) and c). g) is obvious. □ 



7 Examples of tables and corresponding association measures 

We now study the behaviour of the measures y , r, D' and the newly proposed measure HS4 for a variety 
of selected tables (see table 1). For this purpose, we study the odds-ratios A G {1,2,5,10,20,50,100} 
and consider the following tables for x = In ^/X^. 

• The diagonal table (?/ = z = 0). 

• An L-shapcd table, characterized hy y = x, z = —x. 

• A junk table with y = 10, z = —y corresponding to poi ~ 1- 

• A junk table with y ~ 10, z ~ ~x corresponding to poo ~ Poi ~ 0-5. 
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• A junk table with y = 10, z ~ y corresponding to poo ~ 1- 

We also like to remark that the table with three equal entries has maximum entropy if A — >■ cx). 

Per definition of a measure, for A = 1 all measures equals zero independent of the concrete realization 
of the table. Since Y is based on the odds-ratio. Y is constant for all tables of the same odds-ratio, y, r, 
D' and HS4 always coincide at diagonal tables, r is maximal at diagonal tables and becomes small for all 
kinds of junk tables. D' is always greater for L-shaped tables than for diagonal tables. D' is close to zero 
in case of poo ~ 1 but could become large for poi ~ 1 which is highly counter-intuitive. HS4 also becomes 
larger for L-shaped tables compared to diagonal tables if A is large. In contrast to D' , HS4 is close to 
zero for both junk configurations poo ~ 1 a-nd poi ~ 1 respectively. The limit tables have a maximum of 
the entropy at poo = Voi = 0-5. This induces a maximum of HS4 for limit tables which increases with A 
(see table 1, fourth rows of each odds-ratio). 



8 Discussion 

In this paper we studied measures of association of 2x2 contingency tables. In contrast to traditional 
independence analysis, we asked for the selection of tables which are far away from independence. This 
objective was motivated by the analysis of high-dimensional molecular genetic data such as SNP array 
data in which a high number of 2x2 tables occur from which one would like to select cases of high 
dependence called linkage disequilibrium. 

In contrast to detecting a (moderate) deviation from independence, quantifying the strength of 
association is multiform. A large number of possible measures were proposed in the literature which 
we shortly reviewed. Many of these measure (r, Z?', Y) agree at diagonal tables. Some of the measures 
become extremal for a vanishing diagonal (r, sMutInf) while for others it suffices that a single cell becomes 
zero {D\ odds-ratio based measures). The measures also markedly differ in cases were one of the rows or 
columns of the table becomes small. Since in practice, it can hardly be decided for these tables whether 
the dependence is strong or not, these tables are not really of interest and are considered as junk tables 
here. Nevertheless, the measure D' can become large in these cases. This is undesirable. D' also varies 
markedly in a small neighbourhood of the vertices of T. 

To study the properties of measures of association, we introduced coordinates on the manifold T 
of all tables mapping it to 3-dimensional space such that the a;-axis corresponds to the log-square root 
of the odds-ratio. We study the measures on the hyperplanes of constant odds-ratio, looking at the so 
called margin weighting functions. These functions are constant for all measures based on the odds-ratio 
which is known to be independent of the margins of the table. For other measures, these functions 
describe the dependence of the measure on the margins for tables with constant odds-ratio. Margin 
weighting functions illustrate major properties of association measures. It helps designing new measures 
with desired properties, which we demonstrated in the second part of the paper. 

The mathematical properties of the margin weighting functions were derived for three measures of 
association, namely r, sMutInf and D' . It revealed that r and sMutInf behave very similarly by up- 
weighting diagonal tables but down-weighting of tables with small rows or columns. In contrast, D' is 
not maximal for diagonal tables. Furthermore, it expresses a strange weighting behaviour for tables with 
small rows and columns, up-weighting or down-weighting these tables in dependence on the position of 
the structural zeros. Such tables occur frequentl y e.g. in SNP data. This pro perty also explains, why the 



estimation problem for D' is not well behaved (jScholz k, Hasencleven . |2010|) . On the other hand, D' as 



well as odds-ratio based measures are constructed to up- weight tables which feature a single small entry. 
These tables represent a prototype of a logical table for which one can conclude the state of the column for 
one row but not for the other row. These kinds of tables arc interesting in genctical statistics since they 
correspond to situations at which no recombinations occurred between two SNPs, i.e. only three of the 
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four theoretically possible liaplotypes are observed. Therefore, we aimed to define an alternative measure 
also highlighting L-shaped tables but with a better behaviour at the margins than D' or odds-ratio based 
measures. ^ ^ 

For this purpose, the entropy (IShannonl . 119481 ) as another canonical structure at 2x2 tables was 
studied. We proved that the margin weighting function of this quantity is maximal at the diagonal for 
odds-ratios within a critical range, namely (^W {1 / e)^ , W {I / e)~'^^ . Outside this range, there are two 
maxima at L-shaped tables, i.e. tables with a single small cell while the others are (almost) equal. More 
precisely, the elements of the opposite diagonal are equal for the maxima. 

The difference between the entropy of a non-diagonal table and the corresponding diagonal table 
of the same odds-ratio is a monotone function of the odds-ratio for fixed margins. A new measure 
of association called HS„ is defined, which is essentially Yules Y weighted by the exponential of this 
entropy difference. This quantity fulfils all requirements of an association measure, i.e. ranges between 
-1 and 1, is zero in case of independence, is symmetric and a monotone function of the odds-ratio for 
fixed margins. In addition, it agrees with Y, D' and r at diagonal tables, up-weights tables with an 
L-shape and large odds-ratio and is extremal in case of a single zero in the table. Hence, the measure has 
similar properties than D' except for a better limit behaviour. Since the entropy difference of tables with 
vanishing row or column is smaller than the entropy of the corresponding diagonal table, degenerated 
tables are markedly down-weighted relative to the diagonal table. The free constant n allows tuning the 
degree of this down- weighting. For practical issues we recommend using n = 4 which yields satisfactory 
results to our experiences. However, our procedure of down-weighting junk tables is neither unique nor 
perfect in the sense that the junk tables are down-weighted to zero. The latter one is not possible within 
the framework of weighting by entropy without loosing other desired properties of the measure, because 
the minimum of the absolute differences between the diagonal table and the degenerated tables of the 
same odds-ratio approaches zero if the odds-ratio tends to or oo. 

We recommend using HS4 instead of D' when interested in selecting L-shaped tables from a large 
set of tables mostly far away from independence and when tables with small marginal frequencies are 
common. When HS4 is estimated from count data, we recommend using Bayesian plug-in estimators of 
the frequencies of sing l e cells showing a good compromise between accuracy and computational burden 



( Scholz fc Hasencleveiil2010l) 
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Appendix 

In this section, we prove the propositions and theorems of our paper. In most situations it is sufficient 
to consider the case a; > since from symmetry conditions the case a; < foUows analogously. 

Proposition 6 (Margin weighting function for r): For all x E M. \ {0}: 

a) rx has exactly one extremum at the origin {y,z) = (0,0), corresponding to the diagonal symmetric 
table with the fixed odds-ratio. 

b) lini||(y,^)||^oo»'x = 0. 

c) lim^^^ioo rx = ±1 

d) r can be extended to M'^ except for the lines (±, ±, *) and (±, *, ±) and the vertices V . 

e) r can be extended to T except for the vertices. 

Proof: a) We consider the maximum condition for rx and a; > 0: 

rx — > max.! <^ — -^^^^^^^^^^^^^^=^^^^^^^^^^^^^^= — > max.! 

^ (e^ + e-J') (e^ + e^'') (e^ + e^) (e^ + e^) ^ min.! 
<^ (e^ + e"^) (e^ + e^) min.! A y = z 

<^ y = z ^0 

b) and c) follow easily using equation (3). d) and e) are consequences of b) and c) □ 



Proposition 7 (Margin weighting function for D'): For all x E M. \ {0}: 

a) D' X has a non-differentiable edge along the diagonal y = z for D' > Q and along the diagonal y = —z 

for D' < 0. There is a non-smooth saddle point in the origin. 

b) 

-1 



lim DL = (e^^ - 1 



a: > 



ySi±oc ^ ^" I (e^T2 + 1) ^ : X <0 

(^^2x ^ ^x±yy'^ . 
(e^=Fy + . 2; < 



liin D'x = 



Thus, limit functions have a range of (0, 1 — e for a; > and (e^^ — 1,0) for x < 0, where is 
obtained for y — > ±00, z — )> ±00, a; > and y — > ^00, z — > ±00, a; < 0. 

c) lim2;^±oo D'x = ±1 

d) D' can be extended to except for the vertices Vg. 

e) D' can be extended to T except for the edges and vertices. 

Proof: a) Assume a; > 0, consider the path y^w-\-c, z^w — c, w = const. Calculating the left- 
hand and right-hand derivative of Dmax at c = using equation (4) yields: 

lim ^D„,ax = Te"' (e^ + 2e"' -f- e-+2"') 
c-j-o±o dc ^ ' 

Since the term in parentheses is positive, D^ax has a wedge at c = 0. On the other hand, D'^ has a 
maximum at y = z = along the path y = z since 

g2y 

D'(y,y) — > max.! <^ - — — — > max.! 

^yy^y (e^+2y + ey)(e^-|-e2') 

<4> (e^ -f e^^^) (e^ + e^) min.! 
^ y = 
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Hence, D' has a non-differentiable saddle point at y = z = 0. The case x < follows analogously. The 
limit behaviour considered in b) to e) is easy to see using equation (4). □ 

Proposition 8 (Margin weighting function for sMutInf): For all x gM. \ {0}; 

a) sMutlnfj; has exactly one maximum at the origin {y,z) = (0,0). 
h) lini|[(j^^^)|[^oo sMutlnf^ 0. 

c) lim^^ioo sMutInf, = ± (log2 (e^±- + 1) - \og^ e?'*-) 
Thus sMutlnfj; — > ±1 for y = =pz and x — > ±oo respectively. 

d) sMutInf can he extended to 'W? except for the vertices VJ,. 

e) sMutInf can be extended completely to T. 

Proof: a) We consider tables t^ = p°) ] and t^ ^ ( P"^^"] of the same odds-ratio 



^ \ pio Pii/p.) " Ni, \ypio pi 

than t for fi^v > and N^^ and are the normalisation constants iV^ = fJ-Poo +P01 +P10 ^-i^d 
Ni, ~ Poo + lypai + Pio/v + Pii respectively. We aim to proof that 



d 

du 
d 



sMutlnf^ (i^; 



= ^ Poo = Pll (S.l) 

M=i 



, sMutlnfa; (t„) 
du 



= ^ Poi=Pio (S.2) 

Assuming A > 1, poo < Pii without restriction of generality, we obtain after some calculations 
d 



— sMutlnf^ (t^; 
dfi 



M=i 



(pii-poo)sMutlnf, (t) (S.3) 

, Poo Pll 

+ Poo log2 7 ■ TT ' r - Pll log2 



(poo + Poi) (poo + Pio) (Pll + Poi) (Pll + Pio) 

(pii -poo)sMutInf^ {t) 

'log2(l+poo(i-l)) l0g2(l+pil(^-l))^ 



-PooPii 



Poo Pll 



The first term is non- negative and equals zero iff poo = Pii. 

Consider the monotonicity of the term '°^^("'"'''^( ^ — ill foj- x G (0, 1). It holds that 

.log2(l + .(l-l)) _ 1 + (S.4) 



dx X a;^ In 2 

with z = a;(y — l) + le (0,1). In this interval, (A. 4) is negative since —lnz+ is monotonically 

increasing for z G (0, 1), taking its maximum for z — > 1. Hence '^J"'' — ^ ^® monotonically decreasing 
for X € (0, 1). In conclusion, the second term of (A. 3) is non-negative too and equals zero iffpoo = Pii- 
This proves (A.l). Analogously, using t^ instead of t^ proves (A. 2). 

b)-e) are obvious exploiting the continuity of the functions involved, i.e. taking the limit of the tables 
first. □ 



Theorem 1 (magic odds-ratio): Define the "magic odds-ratio" by Lmagic = W{l/e) ^ w 12.89. 
Let L > 1. The entropy H restricted to the submanifold of constant odds-ratio L inT 

• has a single maximum at the diagonal table of odds-ratio L if 1 < L < Lmagic ■ 

• has a saddle point at the diagonal table of odds-ratio L and two "L-shaped" tables as maxima which 
transpose with matrix transposition if Lmagic < L. 

"L-shaped" means that for L — > 00 one of the maxima approaches the table ^^^"^ 1/3 )• ^'^^ ^^^^ ^'^'''^ L < I 
a similar result can be derived by transposing principal and secondary diagonals. 
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Proof: The constraint odds-ratio = L can be written in the form: 

In(poo) - In(poi) - In(pio) + hr(pii) = ln(i) (S.5) 

We assume i > 1 in the fohowing without restriction of generahty since the case L < 1 can be studied 
analogously. A second constraint is given by 

Poo +Poi +Pio = 1 (S.6) 

In order to study the critical points of H, wc now consider the extremal value problem of H given 
the constraints (A. 5) and (A. 6). For this purpose, wc introduce Lagrange multipliers Ai and A2 and 
determine the first variation of the following function: 

/(<,Ai,A2) = - (poo • In(poo) +P01 • ln(poi) +P10 • In(pio) • In(pii)) + (S.7) 
+ Ai • (In(poo) - In(poi) - In(pio) + In(pii) - ln(i)) + A2 • (poo + Pai + Pio + Pii - 1) 

Calculating the partial derivatives gives four equations: 

= 
= 
= 
= 

Poi 

In order to solve this system explicitly, we recall that Lambert's W function is defined 
as the inverse function to a;exp(x). Hence, it holds that 

±Ai 

iy(±Aiexp(l-A2)) 

where the upper sign holds for poo and pn and the lower sign for poi ^^nd pio respectively. At the first look 
it seems as that the only solution is the diagonal-symmetric table. But is a multi-branch function since 
y = xcxp(a;) has two solutions for y S (— 1/e, 0). The two real- valued braches are traditionally called Wq 
when X e (-1/e, 0) and W-i when x G (-cx),-l/e). Note that W-i (-1/e) = Wo(-l/e) = -1. 
Assume Ai > 0. Inserting the solutions for pij in the condition on ln(L) we get three possible solutions. 

a) ln(L)=ln(i^^(^) 

This solution exists only for L e {l , W {1 / e)~^] . Wil/e)~^ « 12.89615.... 

b) ln(L) = H^^^WW^) 

This solution exists only for L £ [Vt^(l/e)~^, 00). 

c) ln(L)=ln( '^-^--ff(-) ) 

These solutions exist only for L G [W {1 / e)~^ , 00) . 

Hence, we have a single critical point for L G (l, W{l/e)~^) but three critical points for L G (VF(l/e)~^, 00) 
The next lemma characterises these critical points. 



In(poo) 


f- 1 - 




-A2 




Poo 




In(pii) - 


h 1 - 


Ai 


-A2 




pii 




In(pio) - 


h 1 + 


Ai 


-A2 




PlO 




In(poi) - 


h 1 + 


Ai 


-A2 



Lemma: (Characterisation of the critical points of H for given odds-ratio): 

a) For L G (l, VF(l/e)^^] , H has a maximum at the diagonal table of odds-ratio L. 

b) For L G (VF(l/e)~^, 00), H has a saddle-point at the diagonal table and two maxima at the other two 
critical points, //i — > 00 these maxima tend to the tables ( ^(^^ j and ( ^'j^ j respectively. 
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Proof: We study the following tables: t^, = ^ (^^ j with = /xa + ^ + 2c and U = ^ "^^j 
with ~ 2a + fic + ^, a, c> 0, a + c ^ 1/2. We calculate the second derivative of H {t^) and H {t^) at 
/X = 1 and = 1 respectively. After some calculations one obtains 

1 Va 



dp. 



it,) 



2 V-M 



1 



1 



In contrast 



In2i + VAV l + \/A 



In2i + VAV I + VA 



: In \/A < 



InVA 



Which is greater than for A > W (l/e) ^. Thus (° ^) becames a saddle point for A > (1/e) ^ but is a 
maximum for X < W (l/e) ^. The other suppositions of the lemma and theorem 1 arc then easy to see. □ 



Lemma 1 (Monotony of the entropy difference): Let H he the entropy of t and Hdiag be the 
entropy of the corresponding diagonal table of the same odds-ratio A. Then, Hdiag — H is monotonically 
decreasing for increasing A > 1 and constant margins. 



Proof: Let e > and 



poo+E poi E \ ^ table with increased odds-ratio but same margins com- 

pared to t. We show that ^|^^q 
some calculations we obtain 



„ Hdiag (te) — H [t;,) < 0, whcrc equality holds iff t is diagonal. After 



_d_ 

de 



Hdiag (^e) 



e=0 
d 

de 



H{te) 



x/AlogjA y~v 1 
log2 A 



e=0 



Thus 



d_ 

de 



{Hd^ag (t,) ^ H (t,)) 



log2 A 



6=0 



(S.8) 
(S.9) 

(S.IO) 



Now consider the tables t^ = 4- ( ] and t^ = ^ ( P"^^"] of the same odds-ratio than t for 

N^, \ pio Pii/p- J \i'Pia pii J 

fi,v > 1 and the normalisation constants A''^ = ppoo +Poi +Pio and N^, — poo + i^Poi + Pw I v + pn 

respectively. 

Assume poo Pii and poi Pw without restriction of generality, we see that for / (t) = X]lj=o ^ 
holds that 



d_ 

dp. 

d 
dv 



/(^m) = (Poo-Pi 



{piQ - poi) y] — H — - — 



< 



< 



were equality holds iff t is diagonal. Hence the maximum of the term in parenthesis of (A. 10) is obtained 
iff t is diagonal. On the other hand, for t diagonal it hold that 



(i + \/a) ^,J=oP'^ 







(S.ll) 



□ 
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Table 1 (Measures of association for selected tables): Tables entries rounded to three decimals 
are presented in columns 1 to 4. Normally printed zeros are hard zeros while zeros in italic are values 
less than 0.0005 The fifth columns presents the odds-ratio of the tables. For each odds-ratio, we studied 
five tables: the diagonal table (first row of the corresponding odds-ratio), a table with three equal entries 
(second row), a table for which it holds that poi ~ 1 or poo ~ 1 (third and fifth row respectively) and a 
table for which poo ~ Pai ~ 0.5 (fourth row). The last four columns contain the corresponding measures 
of association rounded to three decimals. . 
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Figure 1: Illustration of the maps 9 and ^ on the boundaries of R"^ and T. represents positive 

number adding up to 1. 



Margin weighting function of r at odds ratio 




Figure 2: Margin- weighting function of r 



Margin weighting function of D' at odds ratio= 




Figure 3: Margin-weighting function of D' 



Margin weighting function of Mutinf at odds ratio 




Figure 4: Margin-weighting function of sMutInf 
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Margin weighting function of I-IS4 at odds ratio= 5 IVIargin weighting function of HS4 at odds ratio= 40 




Figure 5: Margin- weighting function of HS4 



